Nml-optimal Histogram Density Estimation

نویسندگان

  • Petri Kontkanen
  • Petri Myllymäki
چکیده

Density estimation is one of the central problems in statistical inference and machine learning. Given a sample of observations, the goal of histogram density estimation is to find a piecewise constant density that describes the data best according to some pre-determined criterion. Although histograms are conceptually simple densities, they are very flexible and can model complex properties like multi-modality with a relatively small number of parameters. Furthermore, one does not need to assume any specific form for the underlying density function: given enough bins, a histogram estimator adapts to any kind of density. Most existing methods for learning histogram densities assume that the bin widths are equal and concentrate only on finding the optimal bin count. These regular histograms are, however, often problematic. It has been argued [2] that regular histograms are only good for describing roughly uniform data. If the data distribution is strongly non-uniform, the bin count must necessarily be high if one wants to capture the details of the high density portion of the data. This in turn means that an unnecessary large amount of bins is wasted in the low density region. To avoid the problems of regular histograms one must allow the bins to be of variable width. For these irregular histograms, it is necessary to find the optimal set of cut points in addition to the number of bins, which naturally makes the learning problem essentially more difficult. For solving this problem, we regard the histogram density estimation as a model selection task, where the cut point sets are considered as models. In this framework, one must first choose a set of candidate cut points, from which the optimal model is searched for. The quality of each of the cut point sets is then measured by some model selection criterion. Our approach is based on information theory, more specifically on the Minimum description length (MDL) principle developed in the series of papers [3, 4, 5]. MDL is a well-founded, general framework for performing model selection and other types of statistical inference. The fundamental idea behind the MDL principle is that any regularity in data can be used to compress the data, i.e., to find a description or code of it such that this description uses the least number of symbols, less than other codes and less than it takes to describe the data literally. The more regularities there are, the more the data can be compressed. According to the MDL principle, learning can be equated with finding regularities in data. Consequently, we can say that the more we are able to compress the data, the more we have learned about it. Model selection with MDL is done by minimizing a quantity called the stochastic complexity, which is the shortest description length of a given data relative to a given model class. The definition of the stochastic complexity is based on the normalized maximum likelihood (NML) distribution introduced in [6, 5]. The NML distribution has several theoretical optimality properties, which make it a very attractive candidate for performing model selection. It was originally [5] formulated as a unique solution to the minimax problem presented in [6], which implied that NML is the minimax optimal universal model. Later [7], it was shown that NML is also the solution to a related problem involving expected regret. See [8, 9] for more discussion on the theoretical properties of the NML.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information-Theoretically Optimal Histogram Density Estimation

We regard histogram density estimation as a model selection problem. Our approach is based on the information-theoretic minimum description length (MDL) principle. MDLbased model selection is formalized via the normalized maximum likelihood (NML) distribution, which has several desirable optimality properties. We show how this approach can be applied for learning generic, irregular (variable-wi...

متن کامل

MDL Histogram Density Estimation

We regard histogram density estimation as a model selection problem. Our approach is based on the information-theoretic minimum description length (MDL) principle, which can be applied for tasks such as data clustering, density estimation, image denoising and model selection in general. MDLbased model selection is formalized via the normalized maximum likelihood (NML) distribution, which has se...

متن کامل

Computationally Efficient Methods for MDL-Optimal Density Estimation and Data Clustering

The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. T...

متن کامل

Bin width selection in multivariate histograms by the combinatorial method

We present several multivariate histogram density estimates that are universally L1-optimal to within a constant factor and an additive termO( √ logn/n). The bin widths are chosen by the combinatorial method developed by the authors in Combinatorial Methods in Density Estimation (Springer-Verlag, 2001). The present paper solves a problem left open in that book.

متن کامل

Improving Accuracy and E ciency of Mutual Information for Multi-modal Retinal Image Registration using Adaptive Probability Density Estimation

Mutual Information (MI) is a popular similarity measure for performing image registration between di↵erent modalities. MI makes a statistical comparison between two images by computing the entropy from the probability distribution of the data. Therefore, to obtain an accurate registration it is important to have an accurate estimation of the true underlying probability distribution. Within the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008